INTERSPEECH.2005 - Speech Processing

Total: 54

#1 Supergaussian GARCH models for speech signals [PDF] [Copy] [Kimi]

Author: Israel Cohen

In this paper, we introduce supergaussian generalized autoregressive conditional heteroscedasticity (GARCH) models for speech signals in the short-time Fourier transform (STFT) domain. We address the problem of speech enhancement, and show that estimating the variances of the STFT expansion coefficients based on GARCH models yields higher speech quality than by using the decision-directed method, whether the fidelity criterion is minimum mean-squared error (MMSE) of the spectral coefficients or MMSE of the log-spectral amplitude (LSA). Furthermore, while a Gaussian model is inferior to Gamma and Laplacian models when estimating the variances by the decision-directed method, a Gaussian model is superior when using the GARCH modeling method. This facilitates MMSE-LSA estimation, while taking into consideration the heavy-tailed distribution.

#2 A spectral conversion approach to feature denoising and speech enhancement [PDF] [Copy] [Kimi]

Authors: A. Mouchtaris ; J. Van der Spiegel ; P. Mueller ; P. Tsakalides

In this paper we demonstrate that spectral conversion can be successfully applied to the speech enhancement problem as a feature denoising method. The enhanced spectral features can be used in the context of the Kalman filter for estimating the clean speech signal. In essence, instead of estimating the clean speech features and the clean speech signal using the iterative Kalman filter, we show that is more efficient to initially estimate the clean speech features from the noisy speech features using spectral conversion (using a training speech corpus) and then apply the standard Kalman filter. Our results show an average improvement compared to the iterative Kalman filter that can reach 6 dB in the average segmental output Signal-to-Noise Ratio (SNR), in low input SNR's.

#3 Acoustic feedback cancellation in speech reinforcement systems for vehicles [PDF] [Copy] [Kimi]

Authors: Alfonso Ortega ; Eduardo Lleida ; Enrique Masgrau ; Luis Buera ; Antonio Miguel

Passengers communication inside a car can be improved by using a speech reinforcement system. This system picks up the speech of each passenger, amplifies it and plays it back into the cabin through the loudspeakers of the car. Due to the electro-acoustic coupling between loudspeakers and microphones, a closed-loop system is created. To avoid the risk of instability due to the acoustic feedback, acoustic echo cancellation must be performed. Using the Minimum Mean Square Error (MMSE) criterion to adapt the filter, what is very common in acoustic echo cancellation, leads to inaccurate estimates of the Loudspeaker-Enclosure-Microphone (LEM) path due to the closed-loop operation of the system. In this paper, the solution obtained with the MMSE criterion for a Finite-length Impulse Response (FIR) causal adaptive filter is derived, showing that the identification error depends on the amplification factor of the system, the delay of the loop and the spectral characteristics of the excitation signal. The use of whitening filters is proposed and justified to improve the acoustic echo cancellation in speech reinforcement systems for cars. Results obtained for a one-channel speech reinforcement system are presented.

#4 Implicit control of noise canceller for speech enhancement [PDF] [Copy] [Kimi]

Authors: Julien Bourgeois ; Jürgen Freudenberger ; Guillaume Lathoud

Widrow's interference canceller, adapted by the normalized LMS (NLMS), is a standard approach for separating signals from multiple speakers, for example from the driver (target) and the codriver (interferer) in a car. In practice, the adaptation must be carried out only when the interferer is dominant, i.e. only when some estimate of the signal-to-interference ratio (SIR) is below a certain threshold. In this paper, we present the Implicitly controlled LMS (ILMS), a modification of the NLMS. ILMS adaptation is performed continuously using a variable step-size, whose design implicitly detects dominance of the interferer over target activity. Specific measures are taken to guarantee the stability during adaptation. Theoretical analysis of the ILMS transient convergence and stability conditions prove significant improvement with respect to the original NLMS. Experimental results on real in-car data assess the predicted behavior.

#5 Speech enhancement using Markov model of speech segments [PDF] [Copy] [Kimi]

Authors: T. M. Sunil Kumar ; T. V. Sreenivas

It has been shown that the Iterative Weiner Filtering (IWF) requires both intra-frame and inter-frame constraints to ensure that the enhanced speech spectra possess natural characteristics of speech. One automated way to apply the intra-frame constraints is Codebook Constrained Wiener Filtering (CCWF). In the present work, we propose a new method of imposing the inter-frame constraints based on Markov modeling of speech segments. We will show that the proposed method improves both Average segmental log-likelihood ratio and Average segmental SNR of the enhanced speech even at SNRs below 0dB.

#6 A wavelet based noise reduction algorithm for speech signal corrupted by coloured noise [PDF] [Copy] [Kimi]

Authors: Vladimir Braquet ; Takao Kobayashi

In this paper, we present a node dependent wavelet thresholding approach in order to remove strongly coloured noises from speech signals. The noise power in each node is first estimated using a recursive method. Given the voiced or unvoiced nature of the frame, the signal is expanded onto a predefined best basis. Then a infinitely smooth soft threshold is applied depending on each node of the decomposition tree. Finally the estimated clean signal is reconstructed. Experimental results on a Japanese database, for various coloured noises, demonstrate the effectiveness of the proposed method, even at low SNR. Compared with the common level dependent method, this algorithm provides better denoising results.

#7 Speech enhancement in temporal DFT trajectories using Kalman filters [PDF] [Copy] [Kimi]

Authors: Esfandiar Zavarehei ; Saeed Vaseghi

In this paper a time-frequency estimator for enhancement of noisy speech signals in the DFT domain is introduced. This estimator is based on modelling and filtering the temporal trajectories of the DFT components of noisy speech signal using Kalman filters. The time-varying trajectory of the DFT components of speech is modelled by a low order autoregressive process incorporated in the state equation of Kalman filter. The performance of the proposed method for the enhancement of noisy speech is evaluated and compared with MMSE log-STSA estimator and parametric spectral subtraction. Evaluation results show that the incorporation of temporal information through Kalman filters results in reduced residual noise and improved perceived quality of speech.

#8 Formant-tracking linear prediction models for speech processing in noisy environments [PDF] [Copy] [Kimi]

Authors: Qin Yan ; Saeed Vaseghi ; Esfandiar Zavarehei ; Ben Milner

This paper presents a formant-tracking method for estimation of the time-varying trajectories of a linear prediction (LP) model of speech in noise. The main focus of this work is on the modelling of the non-stationary temporal trajectories of the formants of speech for improved LP model estimation in noise. The proposed approach provides a systematic framework for modelling the interframe correlation of speech parameters across successive frames, the intra-frame correlations are modelled by LP parameters. The formant-tracking LP model estimation is composed of two stages: (a) a pre-cleaning intra-frame spectral amplitude estimation stage where an initial estimate of the magnitude frequency response of the LP model of clean speech is obtained and (b) an inter-frame signal processing stage where formant classification and Kalman filters are combined to estimate the trajectory of formants. The effects of car and train noise on the observations and estimation of formants tracks are investigated. The average formant tracking errors at different signal to noise ratios (SNRs) are computed. The evaluation results demonstrate that after noise reduction and Kalman filtering the formant tracking errors are significantly reduced.

#9 Statistical noise compensation for cochlear implant processing [PDF] [Copy] [Kimi]

Authors: Hui Jiang ; Qian-Jie Fu

A statistical noise compensation algorithm is proposed for cochlear implant processing to improve cochlear implant patients' speech performance in noise. With the well-known environmental model for speech in additive noise, the MMSE (minimum mean square error) estimation of clean speech signals was derived according to the noisy speech observation based on a linear approximation of the original nonlinear environmental model. Words-in-sentences recognition by four cochlear implant subjects was tested under different noisy listening conditions (steady white noise and 6-talker speech babble at +15, +10, +5, and 0 dB SNR) with and without the noise compensation algorithm. For steady white noise, a mean improvement of 36% correct of sentence recognition scores was obtained across the SNR levels when the noise compensation algorithm was applied to cochlear implant processing. However, the amount of improvement was highly dependent on the SNR levels with the speech babble noise. The improvement was gradually increased from 7% to 32% correct when the SNR levels increased from 0 dB to 15 dB. The results suggest that cochlear implant patients may significantly benefit from the proposed noise compensation algorithm in noisy listening.

#10 WPD-based noise suppression using nonlinearly weighted threshold quantile estimation and optimal wavelet shrinking [PDF] [Copy] [Kimi]

Authors: Tuan Van Pham ; Gernot Kubin

A novel speech enhancement system based on wavelet packet decomposition (WPD) is proposed. Noise level is estimated based on quantile in wavelet threshold domain. To handle colored and non-stationary noises, the universal thresholds are weighted by a time-frequency dependent nonlinear function. Two nonlinear weighting methods using temporal threshold variation and kernel smoothing are proposed. The weighted thresholds are smoothed and employed for wavelet shrinking with an adaptive factor to compress noise while preserving speech quality. The proposed system is evaluated and compared with other algorithms based on spectral subtraction via objective measures and subjective tests to demonstrate its superior performance.

#11 Subjective and objective quality assessment of regression-enhanced speech in real car environments [PDF] [Copy] [Kimi]

Authors: Weifeng Li ; Katunobu Itou ; Kazuya Takeda ; Fumitada Itakura

In this paper, we propose a nonlinear regression method for speech enhancement, whose idea approximates the log spectra of clean speech with the inputs of the log spectra of noisy speech and estimated noise. We compared both subjective and objective assessments on regression-enhanced speech to those obtained through spectral subtraction (SS) and short-time spectral amplitude (STSA) methods. Our subjective evaluation experiments, which included Mean Opinion Score (MOS) and Pairwise Preference Test (PPT), show that the proposed regression-based speech enhancement method provides consistent improvements in overall quality in all seven driving conditions. The proposed method also performs the best in most objective measures.

#12 A model for selective segregation of a target instrument sound from the mixed sound of various instruments [PDF] [Copy] [Kimi]

Authors: Masashi Unoki ; Masaaki Kubo ; Atsushi Haniu ; Masato Akagi

In this paper, as a first step towards constructing a selective sound segregation model for use in modeling phenomena such as the cocktail party effect, we consider a basic problem of selective sound segregation for instrument sounds using a single-channel method (monaural processing). We propose a model concept for selective sound segregation based on auditory scene analysis and then describe implementation of a model for segregating target instrument sound from the mixed sound of various instruments. The proposed model consists of two blocks: a model of segregating two acoustic sources as bottom-up processing, and selective processing based on knowledge sources as top-down processing. Two simulations were done to evaluate the proposed model combining bottom-up and top-down processing. Results showed that the model could selectively segregate the target instrument sound from the mixed sound by using prior information, and that using both the bottom-up and top-down processing was more effective than using either separately. Since these simulations can be interpreted as representing concurrent vowel segregation in the case of a speech signal, it should be possible to extend the proposed model to a selective speech segregation model.

#13 Improved decision directed approach for speech enhancement using an adaptive time segmentation [PDF] [Copy] [Kimi]

Authors: Richard C. Hendriks ; Richard Heusdens ; Jesper Jensen

Short-time Fourier transform (STFT) methods are often used to overcome the degradation of speech signals affected by noise. STFT-gain functions are usually expressed as a function of the a priori SNR, say ξ, and good techniques to estimate ξ are of vital importance for the quality of enhanced speech. Often, ξ is estimated using the so-called decision directed approach (DD). However, the DD approach builds on a number of approximations, where certain expected values of signal related quantities are approximated by instantaneous estimates. In this paper we present a method to improve these approximations by combining the DD approach with an adaptive time segmentation. Objective and subjective experiments show that the proposed method leads to significant improvements compared to the conventional DD approach. Furthermore, simulation experiments confirm a decreased amount of non-stationary residual noise.

#14 Generalized filter-bank equalizer for noise reduction with reduced signal delay [PDF] [Copy] [Kimi]

Authors: Heinrich W. Lollmann ; Peter Vary

An efficient realization of a low delay filter-bank, termed as generalized filter-bank equalizer (FBE), will be proposed for noise reduction with low signal delay. The FBE is equivalent to a timedomain filter with coefficients adapted in the frequency-domain. This filter-bank structure ensures perfect signal reconstruction for a variety of spectral transforms with less restrictions than for an analysis-synthesis filter-bank (AS FB). A non-uniform frequency resolution can be achieved by an allpass transformation. In this case, the FBE has not only a lower signal delay than the AS FB, but also a lower algorithmic complexity for most parameter configurations. Another advantage of the FBE is the lower number of required delay elements (memory) compared to the AS FB. The noise reduction achieved by means of the AS FB and the FBE is approximately equal.

#15 A pitch-based model for separation of reverberant speech [PDF] [Copy] [Kimi]

Authors: Nicoleta Roman ; DeLiang Wang

In everyday listening, both background noise and reverberation degrade the speech signal. While monaural speech separation based on periodicity has achieved considerable progress in handling additive noise, little research has been devoted to reverberant scenarios. Reverberation smears the harmonic structure of speech signals, and our evaluations using a pitch-based separation algorithm show that an increase in the room reverberation time causes degradation in performance due to the loss in periodicity for the target signal. We propose a two-stage monaural speech separation system that combines the inverse filtering of the room impulse response corresponding to target location with a pitch-based speech segregation method. As a result of the first processing stage, the harmonicity of a signal arriving from target direction is partially restored while signals arriving from other locations are further smeared, and this leads to improved separation. A systematic evaluation shows that the proposed system results in considerable signal-to-noise ratio gains across different conditions.

#16 On noise gain estimation for HMM-based speech enhancement [PDF] [Copy] [Kimi]

Authors: David Y. Zhao ; W. Bastiaan Kleijn

To address the variation of noise level in non-stationary noise signals, we study the noise gain estimation for speech enhancement using hidden Markov models (HMM). We consider the noise gain as a stochastic process and we approximate the probability density function (PDF) to be log-normal distributed. The PDF parameters are estimated for every signal block using the past noisy signal blocks. The approximated PDF is then used in a Bayesian speech estimator minimizing the Bayes risk for a novel cost function, that allows for an adjustable level of residual noise. As a more computationally efficient alternative, we also derive the maximum likelihood (ML) estimator, assuming the noise gain to be a deterministic parameter. The performance of the proposed gain-adaptive methods are evaluated and compared to two reference methods. The experimental results show significant improvement under noise conditions with time-varying noise energy.

#17 Speech enhancement using auditory phase opponency model [PDF] [Copy] [Kimi]

Authors: Om Deshmukh ; Carol Espy-Wilson

In this work we address the problem of single-channel speech enhancement when the speech is corrupted by additive noise. The model presented here, called the Modified Phase Opponency (MPO) model, is based on the auditory PO model, proposed by Carney et. al., for detection of tones in noise. The PO model includes a physiologically realistic mechanism for processing the information in neural discharge times and exploits the frequency-dependent phase properties of the tuned filters in the auditory periphery by using a cross-auditory-nerve-fiber coincidence detection for extracting temporal cues. Initial evaluation of the MPO model on speech corrupted by white noise at different SNRs shows that the MPO model is able to enhance the spectral peaks while suppressing the noise-only regions.

#18 A stereo input-output superdirective beamformer for dual channel noise reduction [PDF] [Copy] [Kimi]

Authors: Thomas Lotter ; Bastian Sauert ; Peter Vary

This contribution presents a stereo input-output beamformer for dual channel noise reduction. The computationally very efficient beamformer adapts superdirective filter design techniques to binaural input signals to optimally enhance signals from a given spatial direction. The beamformer outputs a stereo enhanced signal and thus preserves the spatial impression. Experiments in a real environment using a dummy head and various speech sources indicate that the proposed algorithm is capable of improving the speech intelligibility significantly.

#19 Kalman filters for time delay of arrival-based source localization [PDF] [Copy] [Kimi]

Authors: Ulrich Klee ; Tobias Gehrig ; John McDonough

In this work, we propose an algorithm for acoustic source localization based on time delay of arrival (TDOA) estimation. In earlier work by other authors, an initial closed-form approximation was first used to estimate the true position of the speaker followed by a Kalman filtering stage to smooth the time series of estimates. In the proposed algorithm, this closed-form approximation is eliminated by employing a Kalman filter to directly update the speaker position estimate based on the observed TDOAs. In particular, the TDOAs comprise the observation associated with an extended Kalman filter whose state corresponds to the speaker position. We tested our algorithm on a data set consisting of seminars held by actual speakers. Our experiments revealed that the proposed algorithm provides source localization accuracy superior to the standard spherical and linear intersection techniques. Moreover, the proposed algorithm, although relying on an iterative optimization scheme, proved efficient enough for real-time operation.

#20 Simultaneous adaptation of echo cancellation and spectral subtraction for in-car speech recognition [PDF] [Copy] [Kimi]

Authors: Osamu Ichikawa ; Masafumi Nishimura

For noise robustness of in-car speech recognition, most of the current systems are based on the assumption that there is only a stationary cruising noise. Therefore, the recognition rate is greatly reduced when there is music or news coming from a radio or a CD player in the car. Since reference signals are available from such in-vehicle units, there is great hope that echo cancellers can eliminate the echo component in the observed noisy signals. However, previous research reported that the performance of an echo canceller is degraded in very noisy conditions. This implies it is desirable to combine the processes of echo cancellation and noise reduction. In this paper, we propose a system that uses echo cancellation and spectral subtraction simultaneously. A stationary noise component for spectral subtraction is estimated through the adaptation of an echo canceller. In our experiments, this system significantly reduced the errors in automatic speech recognition compared with the conventional combination of echo cancellation and spectral subtraction.

#21 Variable step size adaptive decorrelation filtering for competing speech separation [PDF] [Copy] [Kimi]

Authors: Rong Hu ; Yunxin Zhao

Two variable step size (VSS) techniques are proposed for adaptive decorrelation filtering (ADF) to improve the performance of competing speech separation. The first VSS method applies gradient adaptive step-size (GAS) to increase ADF convergence rate. Under some simplifying assumptions, the GAS technique is generalized to allow the combination with additional VSS techniques for ADF algorithm. The second VSS method is based on error analysis of ADF estimates under a simplified signal model to decrease steady state filter error. An integration of both techniques into ADF was tested with TIMIT speech data convolutively mixed by reverberant room impulse responses. Experimental results showed that the proposed algorithm significantly increased ADF convergence rate and improved gain in both target-to-interference ratio (TIR) and phone recognition accuracy of the target speech.

#22 Speech extraction in a car interior using frequency-domain ICA with rapid filter adaptations [PDF] [Copy] [Kimi]

Authors: Daisuke Saitoh ; Atsunobu Kaminuma ; Hiroshi Saruwatari ; Tsuyoki Nishikawa ; Akinobu Lee

This paper describes two new algorithms for blind source separation (BSS) based on frequency-domain independent component analysis (FDICA). One is FDICA with pre-filtering by a speech sub-band passing filter to slow down the learning speed in low signal-to-noise ratio (SNR) sub-bands. The other is FDICA with sub-band selection learning to reduce the number of iterations for those sub-bands. The results of speech recognition experiments show that each method can improve word accuracy by as much as 7% and that the second method can increase the speed by approximately 60%.

#23 Speech enhancement using non-acoustic sensors [PDF] [Copy] [Kimi]

Authors: Rongqiang Hu ; Sunil D. Kamath ; David V. Anderson

This paper describes a speech enhancement system that significantly improves speech intelligibility of noisy speech in the context of a speech coder in low SNR conditions. The system uses two state-of-the-art non-acoustic sensors, a general electromagnetic motion sensor (GEMS) that detects the internal motions of glottis, and a physiological microphone (P-mic) that measures vibrations of the skin associated with speech. Both sensors are relatively immune to ambient acoustic noise, but provide incomplete information of speech. In the proposed system, the strengths of two algorithms , a perceptually motivated constant-Q (CQ) algorithm and an enhanced glottal correlation (GCORR) algorithm, are combined. The CQ algorithm employs a perceptually inspired signal detection technique to estimate the presence of speech cues in low SNR conditions. To reduce annoying artifacts, a state-dependent mechanism discriminating the distinct acoustic properties of each phoneme, and a psychoacoustic masking model are used to control enhancement gains. The enhanced glottal correlation algorithm extracts the desired speech signal from the noisy mixture, using a modified speech-GEMS correlation estimation of the speech signal with the glottal waveform supplied by GEMS. Both subjective and objective experiments were performed in a variety of noise conditions to indicate the improvement relative to the EMSR algorithm.

#24 Improved blind dereverberation performance by using spatial information [PDF] [Copy] [Kimi]

Authors: Marc Delcroix ; Takafumi Hikichi ; Masato Miyoshi

In this paper we consider the numerical problems faced by a blind dereverberation algorithm based on a multi-channel linear prediction. One hypothesis frequently incorporated in multi-microphone dereverberation algorithms is that channels do not share common zeros. However, it is known that real room transfer functions have a large number of zeros close to the unit circle on the z-plane, and thus many zeros are expected to be very close to each other. Consequently if few microphones are used, the channels would present numerically overlapping zeros and dereverberation algorithms would perform poorly. We study this phenomenon using the previously reported LInear-predictive Multi-input Equalization (LIME) algorithm. Spatial information can be used to deal with the problem of overlapping zeros. We describe the improved dereverberation performance that we achieve by increasing the number of microphones.

#25 A hybrid microphone array post-filter in a diffuse noise field [PDF] [Copy] [Kimi]

Authors: Junfeng Li ; Masato Akagi

This paper proposes a hybrid post-filter for microphone arrays with the assumption of a diffuse noise field, where few post-filter performs well, to suppress correlated as well as uncorrelated noises. In the proposed post-filter, a modified Zelinski post-filter is applied to the high frequencies to suppress spatially uncorrelated noise and a single-channel Wiener post-filter is applied to the low frequencies for cancellation of spatially correlated noise. In theory, the proposed post-filter follows the framework of the multi-channel Wiener filter. In practice, experiments using multi-channel noise recordings were conducted and results show that the proposed hybrid post-filter gives the highest SNR improvements and lowest speech distortion among the tested post-filters in various car environments.